Discussion: Graph 1, Connections Between Institutions

This portfolio will look at the connections between academic institutions on Github. The data for this portfolio was extracted from Google’s public data warehouse “github-repos” and pre-processed using Python (R lacks the data structure for efficient pre-processing). The Python code for pre-processing is attached at the end of this report.

The first graph, “Connections Between Institutions”, tries to answer the question of how academic institutions collaborate with each other. To this end, this graph used a node-link diagram to picture the broad GitHub collaboration landscape where each node is an academic institution, and there is an edge between two institutions if they have contributed to the same repository. There are 3 notable design choices to encode more information: 1) nodes are filled with colors that correlate to how many contributers there are: the brighter an institution is, the more contributers it has 2) edges are drawn in colors that correlate to how many shared contributers there are between two institutions: the brigher an edge is, the more shared contributers there are. 3) the graph is visualized with the “kk” layout, where institutions that connect to many other institutions are at the center of the graph, leaving those with fewer connections (githubby socially inactive) to the periphery.

According to this design choice, the audience can easily spot the more active collaborators by looking at the center of the graph, tell which institution has more committers by looking at its color, and find all collaborating institutions for one node-of-interest by following the edges extending out of it.

For example, we can see that universities like Wisc, MIT, Middlebury, UW, UChicago, etc. are the most collaborative players in the open-source world. On the other hand, universities like toronto, txstate, virginia are a bit socially awkward around Github.

Looking at UW-Madison (labeled in red), it is nearly right at the center of the entire graph, meaning it is one of the most socially pro-active collaborator in the GitHub open-source community. In addition, the color of Wisc is darker, meaning there are not many contributers from Wisc. This means that contributers at UW-Madison, on average, are a lot more collaborative than people in other institutions.

network <- read_csv("data/network.csv")
## Rows: 139970 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): src_committer, src_institution, dst_committer, dst_institution
## dbl (4): src_num_commits, src_total_commits, dst_num_commits, dst_total_commits
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# prepare data for the connections across institutions
institution_network <- network %>%
  group_by(src_institution, dst_institution) %>%
  summarize(shared_committers=n()) %>%
  filter(src_institution != dst_institution)
## `summarise()` has grouped output by 'src_institution'. You can override using
## the `.groups` argument.
# create a tbl_graph object
# load commits by institution data
vertices <- read_csv("data/479_tidy_data.csv") %>%
  group_by(institution) %>%
  summarize(num_committers = n()) %>%
  drop_na() %>%
  rename(name = institution)
## New names:
## * `` -> ...1
## Rows: 108153 Columns: 5── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, email, institution
## dbl (2): ...1, num_commit
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
edges <- data.frame(
  source = institution_network$src_institution,
  target = institution_network$dst_institution,
  edge_committers = institution_network$shared_committers
)

# only preserve edges with at least 4 connections
edges <- edges %>%
  group_by(source) %>%
  mutate(num_connections = n()) %>%
  ungroup() %>%
  filter(num_connections > 3)

# only preserve vertices that are both in source and target
vertices <- vertices %>%
  filter(name %in% edges$source & name %in% edges$target)
# only preserve edges that are in vertices
edges <- edges %>%
  filter(source %in% vertices$name & target %in% vertices$name)

G <- tbl_graph(nodes=vertices, edges = edges, directed = FALSE)

G <- G %>%
  mutate(node_weight = runif(num_committers)) %>%
  activate(edges) %>%
  mutate(edge_weight = runif(edge_committers))

# visualize the tbl_graph object
ggraph(G, layout = 'kk') + 
  geom_edge_link(aes(col = edge_weight, width=edge_weight), alpha=0.5) +
  scale_edge_width_continuous(range = c(0, 0.5)) +
  geom_node_label(aes(label = name, fill=node_weight, color=ifelse(name == "wisc", "#ff0000", "#ffffff"))) +
  coord_fixed() +
  labs(title = "Connections Between Institutions") + 
  theme_void() +
  theme(plot.title = element_text(size = 20, hjust = 0.5)) + 
  scale_fill_continuous(type = "viridis") +
  scale_edge_colour_viridis() + 
  guides(edge_width = FALSE) + 
  scale_colour_identity()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Discussion: Graph 2, Connections Between Committers Within WISC

The second graph, “Connections Between Committers Within WISC”, zooms in on UW-Madison to see the collaboration landscape between committers within WISC by using a node-link diagram. In the diagram, each node represents a unique contributer belonging to UW-Madison, and there is an edge between two nodes if they have contributed to the same repository. All but 2 design choices for the second graph are similar to the first graph 1) the weight of an edge is define by the shared number of commits to the repository that two nodes jointly contributed to 2) the weight of an edge is visualized not only by the darkness of the color, but also the width of the edge. These design choices will most clearly reveal both who are the most collaborative player inside the institution, and the institutional “social network” for open-source contributions.

An interesting observation from the graph was the big cluster at the center, formed my multiple contributers. An intuitive interpretation of the cluster is that this is a large-scale open-source project led by UW-Madison, where each node is a student or faculty member working on this project. After some searching on GitHub, this project appeared to be “ht-condor”, the open-source operating system for the Center of High Throughput Computing’s supercomputer (https://htcondor.org/). It seems that the initial hypothesis that each cluster represents a large-scale open-source project is correct. We can potentially explore other school’s open-souce projects in this way.

# filter out the rows where the src institution and dst institution are the tagret institution
target_institution <- "wisc"
target_network <- network %>%
  filter(src_institution == target_institution & dst_institution == target_institution)

# create a tbl_graph object
vertices <- target_network %>%
  group_by(src_committer) %>%
  summarize(node_commits=sum(src_total_commits)) %>%
  rename(name = src_committer)

edges <- data.frame(
  source = target_network$src_committer,
  target = target_network$dst_committer,
  edge_commits = sum(target_network$src_num_commits, target_network$dst_num_commits)
)

G <- tbl_graph(nodes=vertices, edges = edges, directed = FALSE)

# normalize the raw commit count to [0, 1]
G <- G %>%
  mutate(node_weight = runif(node_commits)) %>%
  activate(edges) %>%
  mutate(edge_weight = runif(edge_commits))

# visualize the tbl_graph object
ggraph(G, layout = 'kk') + 
  geom_edge_link(aes(col = edge_weight, width=edge_weight)) +
  scale_edge_width_continuous(range = c(0, 1.5)) +
  geom_node_label(aes(label = name, fill=node_weight), color="#ffffff") +
  coord_fixed() +
  labs(title = "Connections Between Committers Within WISC") + 
  theme_void() +
  theme(plot.title = element_text(size = 20, hjust = 0.5)) + 
  scale_fill_continuous(type = "viridis") +
  scale_edge_colour_viridis() + 
  guides(edge_width = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# I used the following block of Python code to generate a network of committers from the raw
commit data.

def get_committer_network(commits, commits_by_committer, allow_loose_connection=False):
    """
    Return the connection between committers
    Two committers are said to be connected if they've contributed to the same repository
    :param allow_loose_connection: False -> use full repo_name; True -> use short repo_name
    """
    # construct a hashmap from committer to number of commits
    commits_by_committer_dict = {}
    for _, row in commits_by_committer.iterrows():
        committer = (row["name"], row["institution"], row["email"])
        commits_by_committer_dict[committer] = row["committer_commit"]

    # construct a hashmap from repositories to committers (key: repo_name, value: set of committers)
    repos_committers = {}
    for _, row in commits.iterrows():
        committer = (row["name"], row["institution"], row["num_commits"], commits_by_committer_dict[(row["name"], row["institution"], row["email"])])
        repo_name = row["repo_name"].split("/")[1] if allow_loose_connection else row["repo_name"]
        if repo_name not in repos_committers:
            repos_committers[repo_name] = set()
        repos_committers[repo_name].add(committer)
        
    # construct network from committers to another committer
    network = []
    for _, row in commits.iterrows():
        committer = (row["name"], row["institution"], row["num_commits"], commits_by_committer_dict[(row["name"], row["institution"], row["email"])])
        repo_name = row["repo_name"].split("/")[1] if allow_loose_connection else row["repo_name"]
        for other_committer in repos_committers[repo_name]:
            if committer != other_committer and committer[2] > 10 and other_committer[2] > 10:
                network.append((committer[0], committer[1], committer[2], committer[3], other_committer[0], other_committer[1], other_committer[2], other_committer[3]))
                
    # remove duplicates
    network = list(set(network)) 
    print('dicovered {} connections'.format(len(network)))
    return network

if __name__ == '__main__':
    # read commits from data/commits.csv
    commits = pd.read_csv("data/commits.csv")
    commits_by_committer = pd.read_csv("data/commits_by_committer.csv")
    network = get_committer_network(commits, commits_by_committer)
    # convert netowrk to a dataframe
    network_df = pd.DataFrame(network)
    # save network to disk
    network_df.to_csv("data/network.csv", index=False)